spiked random matrix perspective
Learning in the Presence of Low-dimensional Structure: A Spiked Random Matrix Perspective
In the proportional asymptotic limit where the number of training examples $n$ and the dimensionality $d$ jointly diverge: $n,d\to\infty, n/d\to\psi\in(0,\infty)$, we ask the following question: how large should the spike magnitude $\theta$ (i.e., the strength of the low-dimensional component) be, in order for $(i)$ kernel methods, $(ii)$ neural networks optimized by gradient descent, to learn $f_*$? We show that for kernel ridge regression, $\beta\ge 1-\frac{1}{p}$ is both sufficient and necessary. Whereas for two-layer neural networks trained with gradient descent, $\beta> 1-\frac{1}{k}$ suffices. Our results demonstrate that both kernel methods and neural networks benefit from low-dimensional structures in the data. Further, since $k\le p$ by definition, neural networks can adapt to such structures more effectively.
Learning in the Presence of Low-dimensional Structure: A Spiked Random Matrix Perspective
In the proportional asymptotic limit where the number of training examples n and the dimensionality d jointly diverge: n,d\to\infty, n/d\to\psi\in(0,\infty), we ask the following question: how large should the spike magnitude \theta (i.e., the strength of the low-dimensional component) be, in order for (i) kernel methods, (ii) neural networks optimized by gradient descent, to learn f_*? We show that for kernel ridge regression, \beta\ge 1-\frac{1}{p} is both sufficient and necessary. Whereas for two-layer neural networks trained with gradient descent, \beta 1-\frac{1}{k} suffices. Our results demonstrate that both kernel methods and neural networks benefit from low-dimensional structures in the data. Further, since k\le p by definition, neural networks can adapt to such structures more effectively.
Learning in the Presence of Low-dimensional Structure: A Spiked Random Matrix Perspective
In the proportional asymptotic limit where the number of training examples n and the dimensionality d jointly diverge: n,d\to\infty, n/d\to\psi\in(0,\infty), we ask the following question: how large should the spike magnitude \theta (i.e., the strength of the low-dimensional component) be, in order for (i) kernel methods, (ii) neural networks optimized by gradient descent, to learn f_*? We show that for kernel ridge regression, \beta\ge 1-\frac{1}{p} is both sufficient and necessary. Whereas for two-layer neural networks trained with gradient descent, \beta 1-\frac{1}{k} suffices. Our results demonstrate that both kernel methods and neural networks benefit from low-dimensional structures in the data. Further, since k\le p by definition, neural networks can adapt to such structures more effectively.